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Voice Browsing Architecture Based on Ada ptive Keyword 
Spotting 

Inventor: Sorin Geotgeecu 



1 TECHNICAL FIELD 

Present invention is applicable in the field of voice controlled browsing and multi- 
modal brow^g of Web content, ftom a dual mode voioe&data Mobile Staflon ^S). 

2 TECHNICAL BACKGROUND 

2.1 THE PROBLEM AREA 

Multi-modal browsing is a user friendly method that is used to access content over 
the Internet When accessing content with a multi-modal browser, a user may use 
any of the supported input methods, or combinations thereof. Among input methods 
Implemented so far, the most frequent ones are the key stroke m^od and the voice 
command method. 

No architecture designed so far is capable of adding voice browsing functionality to 
an ordinary user agent running in a dual mode voice&data MS. Existing voice 
browsing systems are instead based on VoiceXML, a language capable of defining 
voice dialogs. In a VoiceXML system, the voice bro>A»lng application (speech 
browser) runs independently of ttie key-stroke browsing application. There is no 
synchronisation between the two browsers. Furthermore, in case the content has not 
been designed for both fbnnats. I.e. HTML/XHTML and VoiceXML, there is no way to 
impl^ent multi-modal browsing. 

2.2 STATE OF THE ART 

Voice browsing Is presently implemented based on the VoiceXML paradigm. 
VoiceXML is a language to define voice dialogs fbr Internet appHcations accessed 
over the phone. In essence, output voice dialogs are cam'ed out through audio and 
text-to-speech prompts, while input dialogs are canied out through touch-tone keys 
(DTMF) and automatic speech recognWon. 

A typical architecture consisis of an Application Server hosting VoiceXML content, 
and the VoiceXML Gateway containing the speech browser (VoiceXML client) and 
the speech/telephony platform. The user-system interaction is done through a voice 
menu, from which the user can specify his selection by votee. All functionality 
regarding speech recognitton, text-to-speech conversion, and DTMF recognitioh, is 
Implemented in the speech/telephony platfonm, whfch converts toffirom speech the 
dialogs specified In the VoiceXML page. The speech browser, based on the content 
Interpreted on4he-fly. Is the one that controls the sequence of voice dialogs. It is 
Important to mention that by this architecture, only the voice part of the MS is used 
during the interaction with the user. 
Read and Understood!^ Date: 
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In above architecture. muW-modal access to Internet applications Is possible only if 
both HTML/XHTML and VolceXML fonnats are available on the Application Server. 
The MS used has to be a dual mode voice&data station. In order to be atde to 
estal^ish simultaneous voice and data sessions. Although theoretically possible, 
multi-modal access to dual fonnat content requires solving a major issue, as 
presented in chapter 2.3. 



2.3 PROBLEMS 



The architectures proposed so far for voice-based applications are centred around 
the concept of voice dialogs usually defined in VoiceXML Therefore, only the 
application/content specifically designed for voice-based interaction may be 
ac^ssed over Oie phone. The huge base of HTMUXHTH^L content will never be 
res»hable. unless converted Into VoiceXML 

AnoUier problem with existing archttsctures is that when combining voice-based 
access with nonnal browsing. wiBt the purpose of implementtng nMilti-modal 
browsing, there is no mechanism that synchronises tiiet»« browsers. e.g. the 
HTMU)01TML browser mnnlng In the data part of the MS and the speech browser 
running In tiie VoiceXML Gateway. Unless a dedicated synchronisation mechanism 
is implemented in ttw Application Seraer. tiie speech browser, and ttie MS user 
agent, switching from one Input metiwd to the ottwr during one and ttie same 
browsing session is not possible. 



3 THEIMVEMTION 
3.1 SUMMARY 



The present Invention overcomes ttie above problem by proposing an architecture 
and a meOwd for tt» selection of speech vocabulary keywords In providing a solution 
to the access to an arbitrary HTMUXHTML page by means of voice commands. 
Multi-modal browslr^ is thus Implemented with no need to change the onginai 
content 

3.2 DESCRIPTION 

The andiiteclure described herein Is based on a speech-enabled HTTP prwcy. This 
proxy, named In the below figure HTTP/Speech proxy, is a HTTP Pro^r enhanced 
with v«ce browsing functionality. It is capable of extracting keywords from browsea 
HTML«CliTML content as directed by predefined rules. The keywords are then 

emphasized in the original content so tiiat tiie user will know what vwords to use In his 
speedi commaiuis when selecting a specific hyperlink. 

Due to the keyword spotting nature of the voice browsing functionaHty, ttie ^itomatic 
Speech Recogniser ttiat the proxy interfaces witti can be a middle size vocabulary 
speech recogniser. Usually, this kind of ASR Is capable of recognising continuous, 
speaker independent speech. Therefore, no user training Is required to set-up a 
system vHXti proposed an^ltecture. 

The rules used to extradt the voice keywords can be grouped Into: 
Read and Understood by: Dale: 
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• Syntactic rules - Rules like "use the subject and the predicate In a paragraph 
associated with a single hyperiink". Several syntactto ailes prioritised with regard 
to ttie avallalrflity of keywords in the vocabulary may be used. 

• Sfmple mies - Rules like "select a unique keyword in ttie hyperiink name, or In 
the paragraph associated with H". Speech commands associated to a simple nde 
may look like 'Go to X' or 'Go to the paragraph containing X'. 

• Numeric niles - This refers to numbering hyperiinks in the page, or multiple 
hyperiinks in ttie same paragraph. R can be used also for opflon selecuon In 
menus. 

During the speech InteracUon, the speech proxy uses speech didogs/prompte 
whenever received commands are ambiguous. Standard text messages containjng 
keywords from recognised vocabulary are fwwarded to the Text To Speech 0T5) 
btock in the Telephony Platform. The TTS block converts the messages into speech, 
which are then sent to the user through the voice channel Speech diatogs maylook 
like: "Did you select the paragraph containing keyword xr . Due to the naftire of 
user-system interaction, the mobile stations operating with proposed architecture, 
need to t»va su(q>ort for concurrent voice and data sessions. 

As a general rule when Implementing mulU-modal browsing it is required »»» a 
synchronisaUon engine exists between the HTTP browser in the user agent and the 
speech browser, reskling besides the voice browsing fUncQonality. 

This document does not address the Implementation of the synchronisation engine, 
instead it describes a possible viray to embody this, to demonstrate that the proposed 
architecture can handle multi-modal access to web content lr» a natoral ««y ;Jf 
HTTP proxy contains a "push" mechanism that forces the user agent In the MS to 
refresh the content, after having fetched the page indicated through voice 
commands from the sen«r. This could be based on a semaphore o^ect (Rf ftroh 
on/off) Inserted in each returned page, and a script downloaded toflether with the 
page. The script forws periodic updates of the semaphore value, thus altowi^ the 
user agent to detect when a page refresh is requested by the speech proxy. Btesea 
on the semaphore's value, the script can then trigger an entire page refresh, tnus 
downloading a fresh content 

In the following, the node Interadton in the proposed architecture is described in 
more detail: 
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1. The proxy connecte to the ASIVTelephony platform, speojnng the 
applicatlonA«)cabulary to connect to. Thb is very useful when an ASRhoste 
s^eral applications Implementing "flavoure" of the user-system speech interface 
O^pe^ sequM^ of >«lce prompts, allowed key strokes. vocalHilanes. etc). 

2. The ASR answers back with the words contained In the vocabulary and We 
SLvSsiiworted by the Telephony platfbn^ 

voice ports, etc.). An ID of the invoked appHcaHon may be returned. 

3- The subscriber opens a nonnal browning sessfon. In order to support 

browsing, the foBowIng Infbnnatlon needs to be stored Into the proxy ^sjJsOTber 
ScoS: Solce browsini On/Off. opttonal keyword used to«99«r ^^^^^S^ 
optional hyperlink name to be Inserted m the accesssed Web page, which upon 
seleclton triggers the opening of the ASR-M8 votoe channel. 

4 The proxy authenticates the user and checks whether or not voice browsing te 
on^mKHP request Is then fonwarded to the ASP. K voice browsing is on. *e 
SSiw Si«»e8 mSJiodto open the ASR-MS voice channel based on user profile 
'nSWy^liS^autom^- (steps 8 and 9). or IriggeredT by user setertton of 
some spedal HTTP link during browsing. In this *cumerit rdbrenw « rn^ 
only to «w "aulomatltf' establishment of the voice channel between the ASR and 
the MS. 

5. The ASP sends back the HTML/XHTML page. 

6. The proxy parses the content and analyses the paragraphs In the P^ using a 
svntacHc analyser in order to find meaningfiJl keywords. Words In hyperHnk 

names may be selected as well as keywords. Selected keywords must be paii of 
downloaded vocabulary, and must not lead to too dose voice commands. The 
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keywords are then highlighted In the page through, for example, ""derewnng. A 
ESS rnay be preSnt in several multi-word voice commands, p^o^rtded there 
teSeSscrimmatlng lnfom««on between '""S;^: g^f^ 
browsing session the following Information shou d to be stored ID of the voice 
broviffiir^ session, subscriber's MSISDN. and selected keywords. 

7. ThePro>vs«wl8'»w«*5*'«*P^®*®*®*''®' 

8. A voice browsing session is opened for the "^;|g;j?ft;i®S^^ 
request should delude the voice browsing session ID, the MSISDN of the user, 
and the application ID If provkled In step 2. 

9 The Telephony platform performs a cal to the MS »»»nf 8P«^«^.'l!l?i5^^^^ 
votoe ^annel concurrent with the data session channel is opened betoken the 
aSr Mid the MS This corresponds to the "automaticr opening of voice 

^m'inillJSl^'SeS tte volc« ch^^^^ 
off manually, through user selection of a special hyperlink. This scenano is noi 
further dealt with In the present document 

10. After the votee channel has been opened (ib, the caU Is answered by the user), 
the /^R returns the staAus data to the proxy. 

1 1 The ASR fonwaids to the proxy the keywords recognised In the "fe^s voice 

''Smn^andS^S keyword islccompanied by te 

onaLainn f he kevwoids the oroxv tries to match th«n to the ones setecied in 
Z^."S SSe^SSSi ifmrSnds correspond, to a <^^i;^jr,°L^^^" 
taMtawiarded keywords or should voice conflnnatlon be activated, the proxy wlB 
*^S^l2S?o SSSS toSl ASWrelep^^^^ ^Zl^^^^ 
Sgwived fromSe'user. the proxy will later onJec^onvA^^to^to. 
(Voice prompting is not represented in the above diagram ftor the sake of 
almplld^.) 

1 2. Using the link obtained In step 1 1 . the proxy sends a GET request to the ASP. 

13. After receiving the reply, the content Is processed In a similar way as m step 6. 

14. The proxy "pushes" the page to the user agent using a mechanism similar to the 
one described above. 

ProDosed voice browsing architecture and method of keyword selecfion solve In a 
StoS^i ttS l^of synchronisation between the MS user agent and thevdoe 

u^rlnput is always uSsync. Regarding the vocabulary rea^nteed by J^e^a 
mWdte sfce vocalJilary adjusted to the most frequently used vrords »n Jhe "wog^ 
S?^eSf aiSind^^^^ woixJs Is probably good enough. StandaK^sp«e* 
aiSiSoroinpte on less suggestive keywords In the paragraphs can be used In case 
Si^mMtSIH^ lie not partof the recognised vocabulary. VolceXML 

may be a good optton to define such speech queries/prompts. 
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3.3 BENEFITS 



This Invention paves the way for voice taowslng of any web page, eliminating the 
need of content translation Into VolceXML or other equlvalert language. It also 
provides the framevwrk for multi-modal browsing, an Issue that has not been 
addressed by state of the art systems. Because of the keyword spothr^ approach it 
anows the user to use more natural speech queries than those presently used m 
VoteeXML dialogs. 



3.4 BROADENING 



In case fWure teimlnals will use Voice over IP (VoIP) for the voice sen/Ices. Uie 
present invention can conslttute the basis for a voice browsing proxy using the data 
channel only. 



4 CLAIM 

An anangement for ooncunent muUi-modal access of an internet page 
characterized in that 

- the accessed page is parsed with regard to Key text elements, words. mkI 
phrases, as defined by certain pre-set rules, and Interpreted by a votes 
prwv (voice mode) 

- the accessed page is biwwsed by means of key strokes (key stroke 
mode) 

- the accessed page contains one set of tags alone, namely XML or the 
like, and ttius is not in need of dual tagging ("one fonmat fits aif^ 
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